2025 Data Fest

Introduction to R, Text Mining, and Sentiment Analysis

2025-09-23

About Me

I’m the Data Equity and Innovation Supervisor for Paid Leave Oregon, where I lead a team of data analysts in transforming complex data into actionable insights.


Outside of work, I explore data through various personal projects that incorporate analytics, visualization, and storytelling. Check out my blogs, dashboards, and talks to see more.

Agenda

  • Introduction to R & R Studio

  • Taylor Data

  • Text Mining

  • Sentiment Analysis

Introduction to R & R Studio

What is R

R is a statistical programming language that’s incredibly powerful for working with data.


Unlike Python which is built around objects, R is based on functions. For our purposes, that means we’ll be calling functions (like \(F(x)=Y\)) to transform our data rather than attaching properties to objects.

Benefits of R

  • R is open source and freely available.

  • R has an extensive and coherent set of tools for statistical analysis.

  • R has an extensive and highly flexible graphical facility capable of production publication quality figures.

  • R has an extensive support network with numerous online and freely available documents.

  • R has an expanding set of freely available ‘packages’ to extend R’s capabilities. 1

What is R Studio

R Studio is an Integrated Development Environment (IDE), similar to VS Code or Sublime.


R Studio provides a more user-friendly interface, incorporating the R Console, a script editor and other useful functionality (like R markdown and GitHub integration). You can find more information about RStudio here.1

Packages

Most of the work we’ll do relies on packages, which are basically toolkits. To use one, you will first need to install it with a base R function install.packages():

install.packages("package_name")

After installing a package, you can load it into your current session with library():

library("package_name")

Tips

When in doubt about what a functions does, or what is in a package, you can type ?function_name() or ?package_name in the R Studio console to open the description page in the Help tab.

CRAN

CRAN is the Comprehensive R Archive Network. It’s where R packages are stored, tested, and shared with the community. If you install a package in R, you’re usually pulling it from CRAN.

Live Demo

  • Create and load R projects

  • R Studio Landscape (Console, help, viewer, environment, files, packages)

  • Global Options (appearance & layout)

  • .R files, Quarto files, render code, comments

  • R Syntax (assignment <-, concatenate c(), extract values [], extract fields $, sequence :, loops)

  • Plots & ggplots

Review Questions

  1. What is the difference between R and R Studio

  2. How do you find documentation on a package or function?

  3. How do you find the version of R you are currently using?

  4. What is the assignment operator?

Taylor Swift Data

Taylor Package

The Taylor package is a comprehensive resource for data on Taylor Swift songs. Data comes from ‘Genius’ (lyrics) and ‘Spotify’ (song characteristics).

Useful links: taylor, taylor repo

Install & Load Data


The downloaded binary packages are in
    /var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
install.packages("taylor")
library("taylor")

Taylor Data

There are three main data sets:

  • taylor_album_songs: which includes lyrics and audio features from the Spotify API for all songs on Taylor’s official studio albums.

  • taylor_all_songs: Taylor’s entire discography.

  • taylor_albums: Sumarizes Taylor’s album release history.

Taylor Album Songs

head(taylor_album_songs)
# A tibble: 6 × 29
  album_name   ep    album_release track_number track_name      artist featuring
  <chr>        <lgl> <date>               <int> <chr>           <chr>  <chr>    
1 Taylor Swift FALSE 2006-10-24               1 Tim McGraw      Taylo… <NA>     
2 Taylor Swift FALSE 2006-10-24               2 Picture To Burn Taylo… <NA>     
3 Taylor Swift FALSE 2006-10-24               3 Teardrops On M… Taylo… <NA>     
4 Taylor Swift FALSE 2006-10-24               4 A Place In Thi… Taylo… <NA>     
5 Taylor Swift FALSE 2006-10-24               5 Cold As You     Taylo… <NA>     
6 Taylor Swift FALSE 2006-10-24               6 The Outside     Taylo… <NA>     
# ℹ 22 more variables: bonus_track <lgl>, promotional_release <date>,
#   single_release <date>, track_release <date>, danceability <dbl>,
#   energy <dbl>, key <int>, loudness <dbl>, mode <int>, speechiness <dbl>,
#   acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
#   tempo <dbl>, time_signature <int>, duration_ms <int>, explicit <lgl>,
#   key_name <chr>, mode_name <chr>, key_mode <chr>, lyrics <list>

Reactable

This package creates a data table with sorting and pagination. The default table is an HTML widget that can be used in RMD and Shiny applications, or viewed from an R console.


The downloaded binary packages are in
    /var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
install.packages("reactable")
library("reactable")

Reactable Taylor Data

reactable(
  taylor_album_songs,
  wrap = FALSE,
  defaultPageSize = 4,
  defaultColDef = colDef(minWidth = 300))

Extract Lyrics

Lets look at one song:

song1 <- data.frame(taylor_all_songs$lyrics[1])
head(song1)

Dplyr

A package for working with data frames that makes it easy to filter, sort, group, and summarize data using simple, readable functions.


The downloaded binary packages are in
    /var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
install.packages("dplyr")
library(dplyr)

Tidy Text

A package that helps turn text (like lyrics or survey responses) into tidy data frames, so you can analyze words, sentiments, and topics with the same tools you use for numbers.


The downloaded binary packages are in
    /var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
install.packages("tidytext")
library(tidytext)

Tidyr

A package for reshaping data, used to make messy data “tidy” by separating, combining, or pivoting columns so each row is an observation and each column is a variable.


The downloaded binary packages are in
    /var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
install.packages("tidyr")
library(tidyr)

Define Stop Words

A built-in dataset of very common words like the, and, of that usually don’t add much meaning. We remove these so the analysis focuses on more meaningful words.

data("stop_words")

Define Songs Words

A custom list of filler or vocalization words (like ooh, ah, la) that appear in lyrics but don’t carry much meaning. We can filter these out so they don’t distract from the main analysis.

song_words <- data.frame(
  word = c(
    "ooh", "di", "eh", "ah", "la", 
    "ha","da", "uh", "huh", "whoa",
    "ba", "hoo")
)

Define Negations

A dataset of words like not, no, never, without that flip the meaning of the words around them. This is useful in sentiment analysis because “not happy” is very different from “happy.”

negations <- data.frame(
  word = c(
      "not", "no", "never", "none", "nobody",
      "nothing", "neither", "nowhere", "cannot",
      "without", "hardly", "barely", "scarcely")
)

Define Data

Data <- lapply(1:nrow(taylor_album_songs), function(i) {
  song  <- taylor_album_songs[i, ]
  lines <- song$lyrics[[1]]

  data.frame(
    doc         = i,
    track       = song$track_name,
    album       = song$album_name,
    line_number = seq_len(nrow(lines)),
    lyrics       = lines$lyric
  )
}) |> 
  bind_rows()

Review Data

Review Questions

  1. What is the importance of stop words?

  2. Why are negations important to consider when analyzing sentiment?

  3. What does the doc column represent in the dataset we created from Taylor’s lyrics?

  4. Why might we want to keep each line of a song instead of collapsing the whole song into one string?

Introduction to Text Mining

Key Definitions

These definitions might feel elementary at first, but that’s the point. The more clearly you understand these simple ideas, the easier it will be to make sense of the more complex modeling steps later.

  • Word: a single word, and smallest unit of analysis.

  • Text: the written content inside a document (the lyrics of a single song).

  • Document: a unit of text (a single song).

  • Corpus: the full collection of texts (all song lyrics).

Key Definitions Continued

  • Vocabulary: the unique set of words across all documents in a set.

  • N-grams: sequence of words (1 = unigram, 2 = bigram, 3 = trigram), we we can use to look for corpus themes.

  • Stop Words: Stop words are common words (like the, and, is, in, at, on) that usually don’t add much meaning for text analysis.

  • Sentiment: Sentiment refers to the emotional tone or attitude expressed in text.

Unnest Tokens

This splits a column into tokens, flattening the table into one-token-per-row.

Let’s start by tokenizing our data set, keeping stop words and song words.

unigram_all <- Data |>
  unnest_tokens(word, lyrics, token = "words") 

Review unigram_all

Notice that there are 132,140 total words in our Corpus.

Vocabulary (unigram_all)

When we count the words we see that there are only 5,146 words in this vocabulary, with the most common words being you, I, the, and and.

unigram_all_vocab <- unigram_all |>
  count(word, sort = TRUE)

Ggplot

To visualize the most common words in our vocabulary we are going to use the ggplot2 package.

install.packages("ggplot2")
library("ggplot2")

The downloaded binary packages are in
    /var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages

Visualize Plot

Then we will use the following code to visualize the most used words in our vocabulary.

unigram_all_vocab |>
  filter(n > 1000) |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

Most Used Words (all)

Removing Stop & Song Words

Since we are not able to derive much insight from that initial analysis, lets try again but removing stop words and our custom song words data frames using an anti_join.

unigram <- Data |>
  unnest_tokens(word, lyrics, token = "words") |>
  anti_join(stop_words) |>
  anti_join(song_words)

Anti Join Unigram

Most Common Word (Code)

unigram |>
  count(word, sort = TRUE) |>
  filter(n > 100) |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

Most Common Word (Graph)

Most Common Word by Album

unigram |> 
  count(album, word, sort = TRUE) |>
  group_by(album) |>
  slice_max(n, n = 10) |>
  mutate(word = reorder_within(word, n, album)) |>
  ggplot(aes(n, word)) +
  geom_col() +
  scale_y_reordered() +
  labs(y = NULL) +
  facet_wrap(~ album, scales = "free_y")

Most Common Word by Album

Bigram (Code)

bigram <- Data |>
  select(doc, album, lyrics) |>
  mutate(lyrics = tolower(lyrics)) |>
  unnest_tokens(
    bigram, 
    lyrics, 
    token = "ngrams", 
    n = 2) |>
  separate(
    col = bigram,
    sep = " ",
    into = c("w1", "w2"),
    remove = FALSE
  ) |>
  filter(!w1 %in% stop_words$word) |>
  filter(!w2 %in% stop_words$word) |>
  filter(!w1 %in% song_words$word) |>
  filter(!w2 %in% song_words$word) |>
  filter(!is.na(bigram))

Bigram (Table)

Most Common Bigram (Code)

bigram |>
  count(bigram, sort = TRUE) |>
  filter(n > 13) |>
  mutate(bigram = reorder(bigram, n)) |>
  ggplot(aes(n, bigram)) +
  geom_col() +
  labs(y = NULL)

Most Common Bigram (Graph)

Most Common Bigram by Album

bigram |> 
  count(album, bigram, sort = TRUE) |>
  group_by(album) |>
  slice_max(n, n = 10) |>
  mutate(word = reorder_within(bigram, n, album)) |>
  ggplot(aes(n, word)) +
  geom_col() +
  scale_y_reordered() +
  labs(y = NULL) +
  facet_wrap(~ album, scales = "free_y")

Most Common Bigram by Album

Review

  1. What’s the difference between a document, text, and the corpus in this project?

  2. Why were stop words and song words removed?

  3. Your facet plot is not sorted within each album. What two helpers fix it?

  4. Why does slice_max(n, n = 10) differ from filter(n > 10) after count()?

Sentiment Analysis

Sentiment Dictionaries

The tidytext package provides access to several sentiment lexicons.


The downloaded binary packages are in
    /var/folders/90/4rtssdj16dl23f_f66qj0t3w0000gn/T//RtmpYVUS24/downloaded_packages
install.packages("tidytext")
library("tidytext")

There are three general-purpose lexicons are included wiht tidytext. We will be using two.

Lixicons - Bing

bing: the words are assigned scores for positive/negative sentiment.

get_sentiments("bing")
# A tibble: 6,786 × 2
   word        sentiment
   <chr>       <chr>    
 1 2-faces     negative 
 2 abnormal    negative 
 3 abolish     negative 
 4 abominable  negative 
 5 abominably  negative 
 6 abominate   negative 
 7 abomination negative 
 8 abort       negative 
 9 aborted     negative 
10 aborts      negative 
# ℹ 6,776 more rows

Lexicons - Afinn

afinn: assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

get_sentiments("afinn")
# A tibble: 2,477 × 2
   word       value
   <chr>      <dbl>
 1 abandon       -2
 2 abandoned     -2
 3 abandons      -2
 4 abducted      -2
 5 abduction     -2
 6 abductions    -2
 7 abhor         -3
 8 abhorred      -3
 9 abhorrent     -3
10 abhors        -3
# ℹ 2,467 more rows

Important Note

These dictionaries were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on. 1

Tidy Albums

Sentiment Trajectory for each ablum

(Cont)

  • sentiment over time by albumn

Bing Join

Most Common +/- Songs

Word Cloud (code)

(add info about)

library(wordcloud)
library(RColorBrewer)

unigram %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  reshape2::acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

Word Cloud (graph)

Negation

(add info)

Review

Conclusion & Resources

Good bye thanks